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This paper shows that a simple algorithm produces the all-prefixes-LCSs-graph in 0(mn) time for 
two input sequences of size m and n. Given any prefix p of the first input sequence and any prefix q of 
the second input sequence, all longest common subsequences (LCSs) of p and q can be generated in time 
proportional to the output size, once the all-prefixes-LCSs-graph has been constructed. The problem can 
be solved in the context of generating all the distinct character strings that represent an LCS or in the 
■ context of generating all ways of embedding an LCS in the two input strings. 
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O ■ 1 Background and Terminologies 

Let A = a\a,2 ■ ■ ■ a m and B = b\b2 ■ ■ ■ b n with m < n be two sequences over an alphabet S. Any sequence 
that can be obtained by deleting some symbols of another sequence is referred to as a subsequence of the 
original sequence. A common subsequence of A and B is a subsequence of both A and B. The longest 
rS \ common subsequence (LCS) problem is to find a common subsequence of greatest possible length.^] 

A pair of sequences may have many different LCSs. In addition, a single LCS may have many different 
embeddings, i.e., positions in the two strings to which the characters of the LCS correspond. We may pick 
out a distinguished embedding for each distinct LCS, e.g., the canonical embedding has been defined to be 
the one in which each character, starting from the beginning of the LCS, is assigned matching positions in 
both sequences as small as possible |l8| . It is more convenient in this paper to distinguish embeddings in 
which the matching positions are chosen as large as possible (starting from the end of the LCS); let us call 
these anticanonical embeddings. Figure [l] shows an example pair of strings and the various LCS embeddings 
and anticanonical embeddings. (The matrix in the figure will be explained later.) 

A few other terminologies and notations that will be useful are as follows. We use Ai to represent the 
prefix a\a2 ■ ■ ■ <Xj of A and similarly for B. When a, = bj, we refer to the pair as a match; otherwise it 
is a clash. 



1 It is reasonable to assume that the alphabet size |E| is at most m, since the actual value of symbols not present in the 
shorter string is irrelevant. Extraneous symbols can be culled efficiently if space usage is not of concern, or hashing can be used 
to obtain a good expected time with little space usage. 
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Figure 1: Listed are the seven different embeddings and three anticanonical embeddings (corresponding 
to the three distinct LCSs) for the strings A = bilabial and B = balaclava. (The naive method of 
generating all LCSs for this pair of strings would produce a list of length 100, because there would be many 
duplications.) In the matrix, the entry shows the rank L[i,j] as per (]l|). The matches are circled and 
are organized into contours as shown by the connecting lines. If a match is dominant, its circle is bold, and if 
the match is antidominant, its rank is bold. (Note that a match may be both dominant and antidominant.) 



The standard "naive" method of computing the length of an LCS is a "bottom-up" dynamic programming 
approach (as in 21 ) based on the following recurrence for the length of an LCS of A; and Bf 



if i = or j = 

L[i,j] = ^ L[i — l,j — 1] + 1 if i,j>0 and cij = bj (1) 

max{L[i — l,j],L[i,j — 1]} otherwise 

(Sankoff |2Q ] may be the first to have published this recurrence, based on the work of Needleman and 
Wunsch pM.) In 0(mn) time, one may fill an array with all the values of L[i, j] for 0<i<mA0<j<n, 
and the length L of an LCS is read off from L[m, n]. The same time bound also suffices to produce a single 
LCS by a "backtracing" approach starting from position [m, n] of the array. At each stage we just step from 
position to a position [i — X,j — 1], [i — X,j], or [i,j — 1] that is responsible for the setting of as 
per (0); each match encountered generates a character of the LCS (in reverse order). 

A few other terminologies are useful for discussing some alternative solution techniques. Figure [l] shows 
the matrix of L values as per (|l]) for a sample pair of input strings, and we will refer to the value of L[i,j] 
as the rank of It is well known and easy to see that the matches can be partitioned by rank so as to 

form contours as illustrated by the zig-zag lines in Figure 0. Starting from the lower left match on a contour, 
motion along a contour proceeds monotonically in both dimensions, i.e., the next match is at or above the 
level of the previous match and at or to the right of the vertical position of the previous match. Different 
contours never cross or touch. Each contour may be completely specified by the dominant matches in the 
upper left corners of the contours, i.e., those matches [i*, j*] for which there is no other match on the 

same contour with i 1 = i* A j' < j* or j' = j* A i' < i* . For discussion of the algorithm to be presented 
in Section^, we also introduce the notion of antidominant matches, i.e., those matches for which 

there is no other match [i',f] on the same contour with i 1 — i* A j' > j* or f = j* hi' > i* . Note also 
that two matches [i* , j*] and with i* < i' can belong to the same common subsequence if and only if 

i* < i' A j* < j' . Thus, the problem of finding an LCS can be expressed as finding a longest sequence of 
matches that is strictly increasing in both dimensions. 
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The best known upper bound on the time to find an LCS with general inputs (i.e., with the time expressed 
only in terms of m and n) is essentially 0(mn/ lg n) ]l4| with a finite alphabet or slightly more with an infinite 
alphabet jlTj , only a small improvement over the naive method. Several other methods have been proposed 
to reduce the time under such circumstances as small alphabet, short LCS, or few dominant matches, e.g., 
p3| , pi] , |l5| , |l2], p| , ||, [?], |^, |, 19 1 . For all of these algorithms, however, there are still inputs that require f2(m 2 ) 
Dre|fT 



time or more.n Thus, the naive method remains a reasonable approach for finding one LCS, particularly in 
light of its simplicity. 

Relatively little attention has been given to the problem of finding all LCS embeddings or all distinct 
LCSs. (In the latter case, different embeddings of the same character sequence would not be counted as 
different LCSs.) The naive approach to generate all LCS embeddings 0] would be to extend the backtracing 
method. At each step, we would consider three possibilities (and continue recursively); from position 
we could add a character to the LCS and move to [i — X,j — 1] if [i,j] is a match, and we could move to 
[i — or [i, j — 1] if the L value there equals L[i, j] (without adding a character to the LCS and regardless 
of whether is a match). One could obviously then remove multiple embeddings of the same LCS to 
obtain a list of all distinct LCSs. This naive approach to generating all LCS embeddings or all distinct LCSs 
could, however, be painfully inefficient. The naive method may traverse exponentially many paths through 
the L matrix even when only one LCS embedding exists. Furthermore, any method of generating distinct 
LCSs that begins by generating all LCS embeddings could have a run time exceeding the output size by a 
factor of approximately ^2.598" as per the maximum number of different embeddings a single LCS could 
have in two sequences of length n |l0| ] . 

Rick jl8| gives a method to produce a compact representation of all LCS embeddings, the LCSs-graph, 
from which all LCS embeddings can be listed in time proportional to the output size. He also notes that 
an extra processing stage can prune the compact representation to one that gives only distinct LCSs. The 
time complexity of his algorithm for constructing the LCSs-graph G is 0(|S|n + T + |G|), where T is 
the time of any algorithm that determines the dominant matches. Thus, the run time of his algorithm is 
sometimes better than O(mn) but certainly could require O(mn) for some inputs. Furthermore, the number 
of distinct LCSs could be as large as approximately 1.442™ for two sequences of length n[[nj|, so actually 
listing all distinct LCSs (or all LCS embeddings) may well erase any gain from constructing the LCSs-graph 
in less than 0(mn) time. Baeza- Yates Q provides another construction from which the LCSs-graph could 
be produced but with a potentially longer time of at least 0(|E| 7ilgn). 0(mn) algorithms for creating a 
structure akin to the LCSs-graph have also been proposed by Gotoh g and Altschul and Erickson (H but 
with much greater complication than the approach to be presented here. 

This paper shows that a much simpler approach than in prior work can be used to perform the pre- 
processing phase in 0(mn) time. Furthermore, the result of this preprocessing phase is a more versatile 
structure, the all-prefixes-LCSs-graph. From the all-prefixes-LCSs-graph, we can list, for any prefix Ai of the 
first input string and any prefix Bj of the second input string, all LCSs of Ai and Bj in time proportional 
to the size of the output. The all-prefixes-LCSs-graph can be constructed either for distinct LCSs or for all 
LCS embeddings. The more interesting case of distinct LCSs is discussed in Section |[ The simpler case of 
all LCS embeddings is discussed in Section ||. 



2 Finding All Distinct Longest Common Subsequences 

The basic methodology employed here for enabling efficient computation of all LCSs of any prefixes Ai and 
Bj of the input sequences is a variation on the idea of creating a directed acyclic graph in which every 
path from the vertex corresponding to represents a different LCS of Ai and Bj. The naive backtracing 
approach mentioned in Section ^ can be thought of as directing edges from to some or all of [i — 1, j — 1], 

2 An example with many dominant matches is when one input string contains repeated occurrences of the pattern abc and 
the other contains repeated occurrences of the pattern cba. 
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[i — 1, j], and [i,j — 1] (according to ranks of the vertices and whether is a match). Then every path 
from would represent an LCS. But, as mentioned before, many paths may be traversed for the same 
LCS or even the same LCS embedding; also, throughout the enumeration of paths, many steps may be taken 
in which no match is added to the current LCS. 

Rick's compact representation of all LCS embeddings in A and B, the LCSs-graph |T§|], may be defined 
as follows (though it is constructed in a much more efficient fashion than this definition suggests). Find the 
transitive closure of the naive graph, remove all but those vertices that are matches belonging to some LCS, 
and remove all edges except those connecting a retained vertex to a retained vertex of next lower rank. (Rick 
actually reverses the direction of every edge, but declining to do so leaves us in a framework more analogous 
to the naive backtracing approach.) It is easy to see that the LCSs-graph can be used to list all LCS 
embeddings in time proportional to the output size. Furthermore, Rick notes that a breadth first search on 
the graph can be used to eliminate certain vertices, leaving a representation of only canonical embeddings. 
As noted before, Rick's construction of the LCSs-graph is relatively complex, and it is predicated upon 
having first found all dominant matches. Furthermore, it is a compact representation only of the LCSs of A 
and B rather than of the LCSs for each of the mn pairs of prefixes Ai and Bj, 

We show here a simple construction of the all-prefixes-LCSs-graph, which can be initially thought of as 
being similar to Rick's LCSs-graph but without restricting attention to vertices that are matches belonging 
to an LCS of A and B. Furthermore, we show how to prune the edges on the fly so that only the single 
anticanonical embedding is represented for each distinct LCS. Though the number of edges in the all-prefixes- 
LCSs-graph may exceed 8(mn), we can still produce essentially an adjacency list representation in 0(mn) 
time, due to a heavy degree of sharing among the adjacency lists of different vertices. 

The precise definition of the all-prefixes-LCSs-graph is as follows. Every vertex has an edge pointing 
to each match of the same rank that is antidominant when considering the input strings A4 and Bj 

(i.e., such that i* < iAj* < j, and there is no other match [i 1 , j'] of the same rank with i' = i* Aj > j' > j* or 
j' = j* hi > i' > i*). In Rick's graph, vertices point to vertices of one lower rank, but the algorithm presented 
here to generate the all-prefixes-LCSs-graph is simplified by having vertices point to vertices of equal rank. 
It is still easy to use essentially the same backtracing method to list the (reversed) LCSs corresponding to 
any starting point we simply need to augment the explicit edges of the graph with the notion whenever 
we include a match in the LCS, we take a diagonal step (subtracting one from each matrix coordinate). 

Before continuing, we should verify that the backtraces from in the all-prefixes-LCSs-graph will 
actually represent each distinct LCS once. 

Theorem 1 Considering all paths from in the all-prefixes-LCSs-graph (augmented with diagonal steps 
from match nodes) provides a one-to-one correspondence with distinct LCSs of Ai and Bj. 

Proof. This is easiest to see by recalling that finding all LCS embeddings (in reverse order) corresponds to 
finding all longest sequences of matches (in the submatrix defined by Ai and Bj ) that are strictly decreasing 
in both dimensions. 

The restricted use of matches that we incorporated into the all-prefixes-LCSs-graph will not make us 
lose any LCSs, by the following reasoning. If we consider a backtrace in which we go from position to 
a match such that there exists another match of the same rank with i' = i* A j > j' > j* or 

j' = j* A i > i' > i* , then we can just replace with [i',f] and get the same LCS. ([£*, j*} and [i',f] 

must match on the same character, since they are in the same row or column.) 

We also will not duplicate any LCSs, by the following reasoning. To get the same LCS twice, there would 
need to be a position from which two edges in the all-prefixes-LCS-graph proceed to two matches on the 
same character, say [i*,j*] and with i >i*. Then would be another match on the same character 

and of the same rank, implying that and are not antidominant for Ai and Bj. ■ 

Now, each adjacency list in the all-prefixes-LCSs-graph can be thought of as being based upon a linked 
list of the antidominant matches along one of the contours; let us call such a linked list a contour list. That 
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Figure 2: Adjacency lists of different nodes in the all-prefixes-LCSs-graph might share portions of a contour 
list. In this example, the contour list could be specified by the central line of "next node" pointers, while 
the adjacency lists of and are largely just excerpts. The two thin diagonal lines are not actual 

pointers present in the data structure; rather they show the implicit relationships that pretail[i, j] is followed 
by tail[i,j] in the adjacency list for [«, j], and similarly for [i' , j']. 



is, the antidominant matches for A and B are nearly sufficient to characterize all distinct LCSs for each 
and Bj (whereas the dominant matches do not suffice, as is illustrated by Rick Q). We need only add to 
the adjacency list for at most two matches that are not antidominant for A and -B, by considering the 
extreme lower left and upper right matches [i*, j*] on the relevant contour satisfying i* < i A j* < j. Thus, 
each adjacency list in the all-prefixes-LCSs-graph is a portion of a contour list, with the possible addition 
of a different head and/or tail node. An adjacency list that has a separate head node and then jumps into 
the midst of a contour list can easily share all but a constant amount of its storage with the contour list; 
see Figure |[ When adjacency lists digress to incorporate a separate tail node, however, we require a small 
digression from the standard linked list representation. It suffices to maintain, for each adjacency list, a 
pointer to the second to last node pretail[i,j] as well as a pointer to the last node tail[i,j], where tail[i,j] 
may not actually be the target of an ordinary linked list next node pointer from pretail[i, j] . (See Figure ||.) 

With the above representation in mind, we can adapt the naive method of calculating LCS length based 
on ([l]) to also set up the appropriate head[i,j), pretail[i, j), and tail[i,j] pointers based on the information 
already computed at positions [i — 1, j — 1], [i — 1, j], and — 1]. We always use tail[i,j] to point to the 
last node on the adjacency list at position [i,j] or assign null for an empty adjacency list. Any other nodes 
in the adjacency list appear in an ordinary linked list beginning at head[i,j] and terminating at pretail[i, j], 
with head[i,j] being null if there are no such nodes. 

The algorithm to construct the all-prefixes-LCSs-graph in 0(mn) time is given in Figure^. We proceed 
through positions in a row-by-row fashion, with trivial handling for matches in line ^| and handling of 
a clash in Lines ^ to fl7[ 

When is a match; the corresponding vertex in the all-prefixes-LCSs-graph just points to itself; any 
other match of the same rank must have a higher row or column index. Consistent with the adjacency list 
approach described above, we use tail[i,j] for the pointer to self and leave head[i,j] null to indicate there 
is nothing else in the adjacency list of 

When is a clash, backtraces in the naive graph need only proceed through whichever of [i — and 
1] are of the same rank as To avoid following duplicate paths that uncover the same embedding 

or uncovering multiple embeddings of a single LCS, we perform an appropriate merging of information 
computed at [i — and — I]. The adjacency lists are always maintained so that the nodes are in order 
along a contour from lower left to upper right, and the adjacency lists at [i — I, j] and — 1] are identical 
(if [i — and — 1] are of the same rank) except for possibly a different head and/or tail. Once we 
conditionally set the tail of the adjacency list for position from the information at [i — I, j] in line 0, we 
need only look for additional information at position — 1} in lines |§| to [l7| 

If a null tail[i,j] was obtained by looking at position [i — I, j], then the adjacency list for contains at 
most one node; in that case tail[i,j] is copied from position [i, j — I], and our work is done. Otherwise, we 
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1 for i = 1 to m do for j = 1 to n do 

2 Create a linked list node p[i,j] to represent vertex in adjacency lists. 

3 Set head[i,j], pretail[i,j , and taiJ[iji] to null. 

4 Compute j] as per (l|). 

5 if is a match then tail[i,j] <~ p[i,j) 

6 if is a clash and L[i,j] > then 

7 if £[i — 1, j] = L[i, j] then tail[i,j] <— iaiZ[i — 1, j] 

8 if j — 1] = L[i, j] then 

9 if tail[i,j] = null then tail[i,j] <— tail[i,j — 1] 

10 else 

11 if tail[i,j — 1] and j] point to same row then pretail[i,j] <— pretail[i,j — 1] 

12 else pretail[i, j] <— tail[i,j — 1] 

13 if pretail[i, j] and faiZ[i,j] point to same column then <— pretail[i, j] 

14 else 

15 Set "next" pointer of preta^[«, j] to iaiZ[i, j] 

16 head[i,j] <— head[i,j — 1] 

17 if head[i,j] = null then /iearf[«,j] pretail[i,j] 

18 endfor endfor 

Figure 3: The algorithm to create the all-prefixes-LCSs-graph representing all distinct LCSs. 



can tentatively form the adjacency list of by just following the adjacency list of — 1] with tail[i, j]. 
But we need lines [y] |l3| to ensure that each LCS is only represented once; these lines strip from this 
tentative adjacency list any matches that are duplicative or are not antidominant. If the condition of line |ll| 
is satisfied, then tail[i,j — 1] must be stripped out; in any case, pretail[i,j] can be given the correct value 
for the case that the adjacency list of contains more than just tail[i,j]. If the condition of line [l^ is 
satisfied, the tentative value of tail[i,j] must be stripped out; since it was not stripped out when considering 
position [i — 1, j], tail[i,j] must be in row i, and only a single node remains in the adjacency list for so 
that our work is again done. Otherwise, we set the next linked list node of pretail[i,j] to tail[i,j] in line |l5| , 
and we set head[i, j] in lines [l^ to [Tt]. (The explicit link from pretail[i,j] to tail[i,j] will not be used when 
processing the adjacency list of it may be overwritten when a higher value of j is considered, or it may 
then become relevant due to pretail advancing along the relevant contour list. As written, the algorithm does 
not guarantee full construction of all contour lists, but it constructs the portions needed in the adjacency 
lists of the all-prefixes-LCSs-graph.) 



3 Finding All Longest Common Subsequence Embeddings 

If we wish to find all LCS embeddings, without duplicating the same embedding, but including multiple 
embeddings of the same LCS, we can follow an approach similar to that of Section |^ but with some simpli- 
fication. We can now have each vertex in the all-prefixes-LCS-graph point to all matches [i* , j*} of the 
same rank with i* < i A j* < j. It is no longer necessary to incorporate pretail information to specify the 
adjacency lists; rather, each adjacency list is simply an excerpt of a contour list, and each can be specified 
with a head and tail pointer. 

It is now relatively easy to see that this version of the all-prefixes-LCSs-graph can be constructed in 
0{mn) time as per the algorithm in Figure ^. We now use line |^ to set self-pointers for a match. Then, 
regardless of whether is a match, we merge the information there with information at whichever of 
[i — and [i,j — 1] is of the same rank. In lines || |l0|, the adjacency list for becomes that of [i — 
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1 for i — 1 to m do for j = 1 to n do 

2 Create a linked list node p[i, j] to represent vertex in adjacency lists. 

3 Set head[i, j], pretail[i,j , and tail[i,j] to null. 

4 Compute j] as per (l|). 

5 if is a match then head[i,j] *— p[i,j] and tail[i,j] p[i,j] 

6 if L[i,j] > then 

7 if L[i- = L[i,j] then 

8 if head[i,j] = null then head[i,j] = head[i — 

9 else Set "next" pointer of head[i,j] to head[i — 

10 iai/[i,j] <— ta«Z[* — 

11 if L[i,j-1] =L[i,j] then 

12 if tail[i,j] is in row i then set "next" pointer of tail[i,j — 1] to head[i,j] 

13 head[i,j] <— head[i,j — 1] 

14 if tail[i,j] = null then iazZ[i, j] <— tail[i,j — 1] 

15 endfor endfor 

Figure 4: The algorithm to create the all-prefixes-LCSs-graph representing all LCS embeddings. 



with possibly added on front. In lines |l2|-|l4|, the adjacency list for — 1] (known to be nonempty due 
to the test in line 11) is merged onto the front of the adjacency list for [i,j] (taking into account possible 
overlap) . 



4 Conclusion 

Simple 0(mn) algorithms have been presented to produce the all-prefixes-LCS-graph in either the context of 
representing all distinct LCSs or of representing all LCS embeddings. Once this graph is constructed, we can 
list all the LCSs (or LCS embeddings) of prefixes A4 and Bj of the two input strings in time proportional to 
the output size. A C language implementation closely following the presentation given in this paper can be 



found at http://www.cs.luc.edu/~rig/lcs. Also included is an implementation of the naive backtracing 
method followed by removal of duplicate LCSs (or duplicate embeddings). Results for many examples are 
included, with the same final list of LCSs (or LCS embeddings) resulting from the algorithms in this paper 
or the naive method followed by removal of duplicates. 
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